Random Sampling for Data Intensive Computations

نویسندگان

  • Dinkar Vasudevan
  • Milan Vojnović
چکیده

We consider estimation of arbitrary range partitioning of data values and ranking of frequently occurring items based on random sampling, within small number of samplings and prescribed accuracy. These problems arise in the context of parallel-processing of massive datasets, e.g. performed in data centers of Internet-scale cloud services and large-scale scientific computations. The range partitioning is a basic block of parallel-processing systems based on the paradigm of map and reduce. For the range partitioning, we consider a direct estimation method based on constructing an arbitraryheight histogram and characterize the estimation error. This approach provides substantial savings in constructing unbalanced range partitionings with respect to a standard approach based on equi-height histograms; our results extend previous work restricted to equi-height histograms. For the problem of ranking of frequently occurring items, we use a lumping of small frequency items that enables us to obtain tighter bounds that are independent of the total number of distinct items in a dataset. The analysis deploys the framework of large deviations that is well suited to typically large scale of data in the considered applications. We demonstrate tightness and benefits of our sampling methods using a large data set of an operational cloud service that involves data at a scale of hundreds of billions of records. Our results provides insights and inform design of practical sampling methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A fast algorithm for computing the sampling distribution of a statistic from discrete populations

In this work we propose a fast algorithm for computing the exact small sampling distribution of a given statistic, when the population random variable is discrete. The algorithm relies on a recursion on block matrices that describes all possible random samples that can be generated. In this way, the power of modern programming which deÞnes objects in term of matrices is fully exploited for effe...

متن کامل

X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories

Silicon based Static Random Access Memories (SRAM) and digital Boolean logic have been the workhorse of the state-of-art computing platforms. Despite tremendous strides in scaling the ubiquitous metal-oxide-semiconductor transistor, the underlying von-Neumann computing architecture has remained unchanged. The limited throughput and energy-efficiency of the state-of-art computing systems, to a l...

متن کامل

Random Sampling: Sorting and Selection

Random sampling techniques have played a vital role in the design of sorting and selection algorithms for numerous models of computing. In this article we provide a summary of sorting and selection algorithms that have been devised using random sampling. Models of computations treated include the parallel comparison tree, the PRAM, the mesh, the mesh with fixed, reconfigurable, and optical buse...

متن کامل

Impact of triplet interference exercises between exercise scheduling methods (random-variable-intensive) in skill development and scoring accuracy for futsal players

Objective: This study aimed to Prepare an educational program based on the overlap and the integration of the random exercise method with the variable exercise method and the intensive exercise method adapted to the capabilities of the research sample and Recognize the effect of the educational program on developing the skill performance and scoring accuracy of the research sample in the futsal...

متن کامل

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees

The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the dista...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010